Model Selection

High-Precision Visual Question Answering

# High-Precision Visual Question Answering

Videorefer 7B Stage2.5

VideoRefer-7B is a multimodal model based on a video large language model, focusing on spatio-temporal object understanding tasks.

Transformers English

Llama 3.2V 11B Cot

Llama-3.2V-11B-cot is a visual-language model capable of spontaneous and systematic reasoning, developed based on the LLaVA-CoT framework.

Transformers English

Xgen Mm Phi3 Mini Instruct Singleimg R V1.5

xGen-MM is a series of the latest foundational large multimodal models developed by Salesforce AI Research. It is improved based on the successful design of the BLIP series, providing more powerful multimodal processing capabilities.

Safetensors English

Internlm Xcomposer2 Vl 7b

InternLM-XComposer2 is a vision-language large model developed based on InternLM2, featuring outstanding image-text understanding and creation capabilities.

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase